PALESTINE IN RUINS 
AND IN MASS MEDIA 





ON METHOD 


MANUEL ECHEVERRIA 


On Method is the second part of Palestine in Ruins and in Mass 
Media. It reviews the fixed-point search engine introduced in 
the first paper of the series (Echeverria, 2021). This paper 
confirms that the algorithms employed so far yield suitable 
signatures of the narrative of a given topic. Comparison with 
refinements are implemented with a much larger set, of 733 
articles on the Israel-Palestine conflict, from the biggest 
Swedish media outlets. These refinements deal with networks 
of words. The sample covers May 2021 when violence escalates. 
In summary, both the methods and results from the first paper 
are robust but the refinements are more reliable. 


The second section provides a mathematical formulation of the 
framework. A general setting is considered where a probability 
measure, for the robustness of a given ranking, is derived. This 
scheme deals with configurations of unknown complexity. In 
summary, naive, basic and mechanical rules with plenty of flaws 
can be successfully used to gain relevant information from 
language, good enough for hypothesis testing. Minimal prior 
knowledge about grammar or language is required. Moreover, 
the algorithm allows for arbitrary or inconsistent rules. In the 
end a probability measure reflecting the usefulness of the 
method is given regardless. This paper also gives probabilistic 
accounting perspectives. 
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INTRODUCTION 


This paper is a methodological piece in a series about media and ideology. The 
first part analyses journalism on the Israeli occupation of Palestine when 
violence escalates. It departs from a dataset on Swedish nation-wide outlets; 
both commercial and public service; in print and online. The analysis is mainly 
carried out with algorithms which are scrutinised, and subsequently given 


mathematical formulation in the second. 


Information Production Theory postulates potential for diversity in news 
coverage and opinions, among commercial newspapers, in low-stakes topics. 
This series investigates the extent the Israel-Palestine situation allows for 
journalistic heterogeneity in states which have a distinctively independent, 


neutral or low-stakes stance towards an ongoing conflict. 


Although Sweden officially has a relatively safe position towards the conflict 
in relation to the international community, the first part of the series on 


Palestine showed that Swedish Television (SVT) followed the state line closely. 


Minister of Foreign Affairs made numerous ambiguous statements in the 
calibrated language of diplomacy but was clear on four points. Firstly, the 
state acknowledges the illegality of settlements and evictions in view of 
international law. Secondly, the conflict is not understood in terms of 
legitimate resistance to hostile Israeli occupation. Thirdly, the state 
acknowledges violence on both sides, but depicts Palestinian as indiscriminate 
acts of terror, whereas Israeli violence is seen as legitimate responses against 
specific enemy targets. Finally, insistence on precise Israeli violence results in 


cognitive dissonance, or justification by blaming the victims. 


SVT propagated all these talking points with remarkable vigour and detail but 
deviated in one conspicuous aspect. There was a deviation on the last 


aforementioned key topic, but towards the Swedish conservative/right-wing 


spectrum. Association of Palestinian victims with terror went beyond the 


stance of the state. In conflict with wide-spread claims about left-wing bias, 


the position of SVT resides in the Swedish far right in this regard. These 


findings suggest that the overarching topic is more high-stakes, less neutral 


or more dependent than the safe official 


deliberations are postponed to forthcoming papers. 


DATA 


position reveals. Theoretical 


The main sample of the escalation in May 2021 is from the six biggest nation- 


wide newspapers. Four in print, and the two most successful online editions. 















































Tl Main Dataset: Six Nation-Wide 1-31 May 
Outlet Official Orientation #Items* Edition #Alexa** 
Dagens Nyheter Independent Liberal 114 (126) PRINT 
Svenska Dagbladet | Uncommitted conservative 90 (101) PRINT 
Géteborgs-Posten | Liberal 63 (67) PRINT 
Sydsvenskan Independent Liberal 78 (81) PRINT 
Aftonbladet Social Democrat L773 (173) WEB BUAZ (TH) 
Expressen Uncommitted Liberal 129 (185) WEB 11836 (26) 
* Redundant outliers removed. Initial Retriever dataset in parenthesis. 


** Overall ranking, 


90 days prior 2021-07-31, 


domestic in parenthesis. 


All of these are considered right wing by Swedish standards, except 


Aftonbladet, commonly considered centre-left. The average time spent on the 


sites is 3:04 and 9:16 respectively. 


The initial main dataset consists of 733 articles between 1-31 May. The 
initial dataset was obtained through Retriever Research online database with 
searches on Palestine, Israel, corresponding citizens, and inflections. After 
cleaning data from blanks mostly announcing other items in the set, and 
omitting one irrelevant outlier of five standard deviations, 647 articles remain 
in the operational set. The median number of articles per outlet is 102, 
standard deviation 36. Only texts are analysed in this study. Including titles, 
the average article is 537 words, median 493, standard deviation 400. 
Departures from the Retriever Research figures is due to the superior accuracy 


of the methods in this paper. This is true also before processing, e.g. online. 














T2 Main Dataset: #Words/Item 
Outlet Average Median SDEV 
Dagens Nyheter 608 BLS a3 
Svenska Dagbladet 55 502 358 
Goteborgs-Posten 546 SIL 29) 
Sydsvenskan 484 480 328 
Aftonbladet Boy) 486 433 
Expressen 474 438 302 
Total 537 493 400 








The sizeable spread is affected by the coexistence of shorter news items 
alongside opinion pieces and essays. The dataset has been processed to 
exclude metadata and noise associated with register procedures from the 
source research database. Titles are included to keep the information as close 


as possible to the original publications and perform relevant content analysis. 


1 In addition: A new subset was downloaded with exactly the same keywords. Still, it had less items than it should 
when compared with a set from a month earlier; duplicates are a minor issue; and so is misleading counts. 
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This study also makes use of Retriever Research’s categories, Politics and War 


& Conflict which serves as a suitable point of reference for relevant opinion 


pieces and news items. The union of these consists of 279 (331) items; with 


items of 436 words on average, median 409, and standard deviation 306. 


Multimedia does not crowd out text size for Aftonbladet in this sample (T4). 


T3 Politics, War & Conflict: #Items 














Outlet Official Orientation #Items Edition 
Dagens Nyheter Independent Liberal Some) PRINT 
Svenska Dagbladet | Uncommitted conservative 52 (62) PRINT 
Géteborgs-Posten | Liberal 24 (26) PRINT 
Sydsvenskan Independent Liberal 22 (24) PRINT 
Aftonbladet Social Democrat 46 (49) WEB 
Expressen Uncommitted Liberal 45 (74) WEB 
Swit Officially Independent 55 (58) WEB 














T4 Politics, War & Conflict: #Words/Item 




















Outlet Average Median SDEV 

Dagens Nyheter 542 506.5 401 

Svenska Dagbladet 420 468.5 277 

Goteborgs-Posten Gulls) 622 2,3) |} 

Sydsvenskan 567 S235 269 

Aftonbladet ae TG) 469 348 

Expressen 387 365 343 

save 228 0) 147 

Total 441 414 307 
HYPOTHESES 


H1: Divergence from the state line exceeds pre-emptive openness of less than 
5 % measured in Echeverria (2018; 2020). 


H2: Heterogeneity exceeds Swedish Television along ideological orientation 


as measured in Echeverria (2021). 


H3: Homogeneity within journals, heterogeneity across outlets with different 


ideological orientation. 


Hi is the main theoretically founded hypothesis. All conceivable outcomes are 
theoretically and politically important. H2 is more interesting for political 
reasons, but useful for further theoretical discussions due to the close match 
between the official state line and Swedish Television. H3 is formulated to find 
out more about the specific modes of labour-market discipline and opinion 
management. It is informative of how hiring and assignment strategies are 
arranged. E.g. diversification of opinions within a journal as a way to compete 
for audiences. Specific rhetorical techniques will be discussed in view of 
Echeverria (2018; 2021). 


Please note that from a statistical point of view, it does not matter if the 
hypothesis is rejected, not rejected or even accepted, as long as the procedure 
to do so is correct. Only Hi has been formulated in line with the general thesis 
in this series, as it is derived from rigorous theory. The rest may or may not 


correspond to my personal beliefs of the state of affairs. 


METHOD 


If one would treat the tempting habit to always report support for hypotheses 
at stake as a rhetorical convention, there would be no need to condemn 
pretence in science in this regard. Ultimately, as long as the method is 
transparent or correct, it is potentially informative, and useful for further 
inferences, regardless if it is ad hoc or not. In the end, either a study is 
replicable or not. Either the hypotheses are derived from theory by means of 


valid deduction — and testable in independent studies — or not. 


Unnecessary energy is still soent on condemnation and lengthy treatments of 
this theoretical triviality, as if one spurious study could forever obstruct all 
scientific efforts of a field by exhausting and spoiling all data for ever. The 
habit of discarding theories based on superficial methodological grounds is a 
game which belongs in the realm of politics, not reason. Perhaps it is just a 
yardstick of the prevalent intellectual culture that so many seem to have 
revealed preferences for dumb, deaf and blind scientists incapable of seeing 
patterns in the world around before formally theorising — over sharp intellects, 
constantly forming theories from an acute perception of even the slightest 
trends in everyday flows of data - for obscure philosophical reasons. Certainly, 
this remark exposes the absurdity of mistaking calculated political trickery, for 
honest indignation in the pursuit of truth. Observe it would be irrational or 
blatantly dishonest to play a game of reason by raising aforementioned 


objections in domains where my response appears to be inadequate. 


Thus the need to spend a minimum of effort to settle this antiquated 
methodological issue, and associated petty objections, in a pair of paragraphs. 
Especially in view of the ridiculous political friction, and vicious attempts to 
suppress this otherwise rather conventional media study, on the bloody Israeli 


occupation of Palestine. 


Quantitative analysis of text is supplemented with qualitative. Computations 
are made with specialised software developed by me, most notably a fixed- 
point search engine; or recognised professional to carry out simulations and 
statistical inference. Programs and results have been checked with accurate 


and widely-available software. 


The first part of the study consists of descriptive statistics related to 
hypothesis testing and the nature of journalism. These statistics are based on 
exploration of journalism by means of algorithms, which run through vast 


amounts of data related to the main threads, guided by theory. 


The second part is a technical treatment of some of the methods. For now, 
the complete-study nature of these automated inquiries, will tend to lower the 
reliance on statistical inference, to deduce reliable parameter estimates. But 
as made clear in the section Mathematical Articulation of the Framework, 


statistical reasoning is at the core of the algorithms. 


RESULTS 


The descriptive statistics in this section begin with a fixed-point search engine 
which tracks words most closely associated with a trace, consisting of an initial 
set of keywords, sentence by sentence. This procedure is referred to as a hunt 
or tracking in what follows. The terms directly associated with a trace are 


called direct association. 


Thereafter, the words most commonly related to the direct association within 
the sentences are tracked. The resulting set of words is the equilibrium 
association, which typically approaches overall frequencies of most used 


words (See Echeverria, 2021a). 


The searches or Hunts with the fixed-point engine were found to give suitable 
signatures, succinctly capturing the main topics of a given discussion, when 
evaluated with quantitative and qualitative content analysis (Echeverria, 
2021). This time, the algorithm is examined with a list of alternatives in order 
to evaluate its general ability: In particular ones departing from paragraphs, 


groups of similar words, or both. 


The associations are presented with their respective shares of the top five in 
parenthesis. The main set (647): Terror and associated inflections occur 264 
times in 165 unique items, 0.4 times/item on average. The direct association 


(DA) to terror, for this benchmark algorithm is: 

Hamas(41%)-Israel(25% )—Gaza(12%)-EU(11%)-Terror branded(11%). 
The corresponding equilibrium association is: 

Israel(36% )—Gaza(23%)-Hamas (19%)-Israel (14%)-Israel’s (9%). 


Table 5 compares this benchmark algorithm with refinements, able to account 
for groups of similar words like inflections and composite words. The results 
suggest that the basic algorithm gives a remarkably accurate account of the 
rankings when compared to more advanced procedures in the main dataset. 
For the latter, each word in the heading is the leader in a network of similar 
words (subjects). For instance, terror is merely one word with 8 counts, and 
terror branded is an independent word with a count of 44 in top 5 in the 
benchmark. In the refinements, introduced in this paper, terror is the leader 


of a network with a count of 264 and 56 subjects, terror branded among these. 


A random sample of 10 translated words from the group with the leader terror 
is: terror organisation, terror act, terror actions, terrorism, IS-terrorist, terror 


tunnels, terror groups, terror movements, terror branded, terrorise. 


See Mathematical Articulation of the Framework. 


T5 Direct Associations by Algorithm Specification 

















Top 5 BC|NC* TERROR** ISRAEL HAMAS GAZA ORGANISATION TOTAL 
#Words Top 5 264 200 LVS 80 60 782 
Real Prob. NC (a) 36% 27% 245 11% 8% 742 
Virtual NC (b) 34% 26% 23% 10% 8% 782 
#Words Top 2-6 ISRAEL HAMAS GAZA ORGANISATION ISLAM TOTAL 
Virtual NC (c) 35% 31% 14% 10% 10% DIS 
#Words Top 2-6 ISRAEL HAMAS GAZA ISLAM EU 
Intuitive NC (d) 3D 31% 14% 10% 10% 6.7) 
#Words Top 5 (BC) HAMAS ISRAEL GAZA TERROR* * EU 
Basic Algorithm 41% 255 12% Lis 11% 408 























* BC = Basic/Benchmark Algorithm: whole words. NC = Network Algorithm: word groups. 
**Terror is a group for refinements, but the word Terror-Branded for benchmark cat. 
(a) Probability(A or E) must subtract 5% intersection of Terror & Organisation (40) 
(b) Double count allowed. Shares of total Virtual - Thus proper 'double count' avoided 
(c) Intersection of just 1. Thus almost coincides completely with real probabilities. 
(d) Organisation dissolved (weaker bargaining position): 2/3 of its counts also in 
Terror. Former drops below top 5 when these are removed, while Terror prevails. 


















































All algorithms in the table track sentences. T5 gives an account of top 5 and 
top 2-6 because the benchmark algorithm departs from the same 
encompassing 264-hit terror-keyword, but only seeks whole words thereafter. 


Thus, it usually gives info on 1+5 top terms. 


Virtual shares are in important regards more real than shares of a total without 
double counts, of course. Real Probabilities can be thought of as proper shares 


obeying the appropriate concept of probability. See Appendix (A1) for details. 
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A parsimonious way of grasping the robustness of a given ranking or order is 
naturally achieved by counting the number of groups it takes to beat the 
lowest member of the order, i.e. its weakest link. In a non-stochastic setting, 
the convention can e.g. be to start with those outsiders with greatest counts 


adjacent to the weakest link, and check how many it takes. 


A measure of robustness of an order towards outsider constructs of unknown 


complexity is given by: 


The probability a sample, of randomly selected groups of outsiders, has a 


combined count greater or equal than the weakest link. 


This probability is informative about the remaining combinations, potentially 
merged in ways, unaccounted by the simplest grouping and ranking rules of 
an algorithm. The number of merged groups it takes to challenge status quo, 
i.e. be on par or beat the weakest link, is the breakthrough number. A count 


equal or above is a breakthrough; the successful coalition is the challenger. 


In other words, the probability of a challenger with breakthrough number (of 
at most) B, given information about the outsiders, is the generalization of the 
natural idea. The probability of a challenger with breakthrough number seven 


is 5.0 % without further information. 
See Mathematical Articulation of the Framework for details. 


Table 6 compares the distribution of frequencies for three informative words, 
and their inflections, with Swedish Television (May 7-16) in the first paper of 
this series. 
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T6 Relative Frequencies Part I & II 























TERROR PALESTINE OCCUPATION TOTAL 
PAPER I (a) 52% SS 1% a) 
SHARE II (b) 27% 54% 19% 981 
SHARE II (c) 23% 55% ZA 433 
(a) Share Paper I, May 1-16 (b) May 1-31 (c) May 7-16 





Clearly, the habit of using the term Palestine is much greater in commercial 
media, in relation to the other two. Moreover, there is a slightly higher 
propensity to use the word occupation. Due to the accuracy of the database, 


this latter discrepancy is also to be considered as a reliable result. 


Table 7 displays the ratio of the number of articles containing Occupation over 


the number of articles containing Terror. 


T7 #Items with Keyword [Occupation/Terror] 


PAPER I RATIO II (a) RATIO II (b) 








Occupation/Terror 50% 64% 82% 





























(a) RATIO : MAY 1-31 (b) RATIO : MAY 7-16 














The tendency in table 6 is also carried over to the number of unique articles 
mentioning occupation or terror respectively. Interestingly enough, 
occupation is mentioned even more frequently, in relation to terror, in the first 


half of the period in the commercial newspapers. 
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POLITICS, WAR & CONFLICT (PWC) 


In the Politics, War & Conflict subset - 279 items after processing - terror 
occurs 144 times, in 93 unique items, about 0.5 times/item on average. The 


direct association has most weight on terror-branded Hamas: 
Hamas(40%)-Israel(23%)-Terror branded(13% )-Gaza(12%)-EU(12%). 
The equilibrium association is: 

Israel(35% )-—Gaza(21% )—Hamas(20%)-Israeli (14%)-Israel’s(10%). 


Table 8 shows the direct association May 1-16 to compare with paper 1. The 
similarity is striking. Moreover, the basic algorithm and the sentence-based 
refinement, coincide almost completely. However, terror (branded) is just 
outside top 5 of the basic algorithm overall, due to a greater variety of 


inflections employed in this sample. 


T8 DA by Algorithm Specification 












































Algorithm BC|NC* Politics, War & Conflict | Top Distribution May 1-16 
NC All Terror (30%) -Israel (26%) -—Hamas (23%) —Gaza (13%) —EU (8%) 
BC All Hamas (37%) -Israel (26%) -Gaza Strip (14%) —Eu (12%) -Rockets (12%) 
NC SVT2 Terror (31%) -Hamas (25%) -Israel (21%) -Gaza(13%)- Rockets (11%) 
BC SVT2 Hamas (31%) -—Terror (30%) -Rockets (15%) -—Gaza (15%) -Several (9%) 
BC SVT1 Hamas (32%) -Terror (32%) -Israel’s (12%) -Gaza (12%) -Rockets (12%) 
*BC = Basic/Benchmark Algorithm; NC = Refined/Network Algorithm. SVT1&2 are results for state 


television in paper 1 and this paper respectively. Terror is as usual a group for refinements, 





but the word Terror Branded in basic cats. 


The two first rows are based on the dataset of Politics, War & Conflict including 
Swedish television (SVT). These are presented in compact molecular form but 
computed as in T5. Additional robustness checks of T6-T8 shows that the 
period May 1-16 yields virtually the same results as May 7-16. 
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BREAK POST BOMBING AP & AL JAZEERA? 


The responsible thing to do is to check if there is a break before and after the 
May 15 Israeli bombing of the offices of Associated Press and Al Jazeera. 
Therefore the first and second half of May are contrasted. Commercial and 
state television are treated separately within the Politics, War & Conflict 


dataset in order to improve commensurability. In summary: 


Direct-Association Signature: When SVT is removed, the overall narrative 


on terror is similar, but the direct associations indicate commercial 
newspapers tell a slightly different story. Both sets of associations seem to be 
related to a progression from a more particular account of the state of affairs, 
to a more general in the second half. Words in the first half are somewhat 
more related to particular descriptions of the conflict on the ground, and ina 
specific regions. However, the narrative is dictated by the antagonism 


between Terror-branded Hamas that rules Gaza and Israel. 


T9 Israeli Bombing of Press Offices 









































Algorithm BC|NC* Politics, War & Conflict | Top Distribution 
NC 1-31 Terror (31%) -Israel (26%) —Hamas (22%) —Gaza (13%) —EU (7%) 
BC 1-31 Hamas (39%) -Israel (24%) - Eu(13%)- Gaza(12%)-Terror Branded (12%) 
NC 1-15 Terror (28%) —Israel (28%) —Hamas (21%) —Gaza (14%) —Rule (9%) 
BC 1-15 Hamas (32%) -Israel (27%) -Gaza Strip (17%) -Rockets (12%) -Terror Branded (12%) 
NC 16-31 Terror (33%) -Israel (24%) -Hamas (23%) —Gaza (11%) -USA (9%) 
BC 16-31 Hamas (41%) - Israel (18%) -Gaza (15%) -USA (13%) -EU (123) 


























* BC = Basic/Benchmark Algorithm; NC = Refined/Network Algorithm 


(Virtual) as in Table 8. 





In view of these figures, the following is conjectured: More generic names of 
places and actors associated with the conflict are prioritised in the second half 


of the month. 
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Commercial outlets focus more on international reactions instead of Hamas 
rockets, especially in the second half. Swedish television maintains focus on 


rockets but with international reactions. 


Individual Searches: The trends remain similar but with relative less weight 


on occupation. SVT increases its relative weight on terror. 


PARAGRAPHS 


A new dataset with intact paragraph structure, in Politics, War & Conflict, 
was downloaded about one month after the first. Although it was downloaded 
with exactly the same keywords, the raw file on commercial outlets 
nevertheless contains 19 items less than it should — 183 after processing. The 
processed files differ with 44. Thus, this set is still suitable for comparing the 


algorithm but somewhat less reliable in terms of external validity. 


T10 Summary Statistics 15* & 2"4 Download #Words/Item 


Politics, War & Conflict - Commercial Outlets 








Dataset Average Median SDEV 
PWCC I (227) 493 479 De Ih 
PWCC II (183) 514 495 313 














As is evident form T11, the algorithms tracking words in paragraphs instead 
of sentences yield essentially the same results. This is especially the case for 


the network-based algorithm. 
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T1l1l Basic, Network and Paragraph Algorithm 


Algorithm: BC|PC|PN* | Politics, War & Conflict II | Top Distribution 








Basic (BC) 1-31 Hamas (40%) -Israel (21%) -EU (17%) -Gaza (12%) -USA (10%) 














Paragraph (PC) 1-31 Hamas (39%) -Israel (25%) - Gaza(14%) - EU(11%) -Israeli (10%) 
























































Network (PN) 1-31 Israel (29%) -Terror (26%) —Hamas (23%) —Gaza (13%) —Rocket (9%) 

Sentence (NC) Terror (37%) -Israel (22%) -Hamas (20%) -Gaza (12%) —-EU (9%) 

BC = Basic Algorithm: Whole words. PC = BC with Counts on paragraphs. PN = Network version 
of PC. NC = Sentence-based Network algorithm, as described in T5, for reference. 


CHECKING ALGORITHM INTERPRETATION 


So far, interpretations have only been based on the concentration of 
molecules, sometimes with knowledge about the groups of their components. 


Although natural, conjectures nevertheless. 


RESULTS 


As expected, the interpretations employed in this paper closely correspond to 
the content — both when the particular sentences counts are based on were 
checked - and in more comprehensive random samples of the articles. The 
logical intuition is that when a sample has been conditioned with a search on 
a relevant trace guided by theory, other relevant terms will tend to be 
attracted as well. The most objectionable conjectures are about the connection 


between Hamas, terror and rule, these will therefore be checked. 


Sentences: Qualitative content analysis reveals that all instances of Hamas 
are directly connected to terror. Thus, the coexistence of the two terms is not 
a coincidence - Hamas is presented as synonymous with a ‘terror 
organisation’. The term rule was as expected connected to Hamas, the rulers 
of Gaza. Turns out that this conjunction exists in 90 % of the occurrences of 


the word ru/e and similar inflections. (See table 9) 
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Random Sample on whole Articles: The same thing is true when a more 
comprehensive analysis based on a random sample of 20 items in the first half 
was Carried out. This time, 11 were related to ru/e proper, and 90% (10/11) 
to Hamas as above. This shows that the same interpretation is valid when 


whole articles are checked at random. 


More generic narrative. It was conjectured that the second half of the 
month had a more general discussion with international actors. There is truth 
in that conjecture when relevant sentences were checked. However, the 
international actors are invoked to underscore the terrorist status of Hamas. 
USA and EU are more frequently cited, in the second half of the month, but 


as authorities that have branded Hamas as a terrorist organisation. 


In conclusion, the perhaps most natural interpretations that can be made at 
first glance of the molecular composition corresponds to the actual content of 
the associated sentences of the tracking procedure. More comprehensive 
random samples of articles yield similar results. However, it is important to 
bear in mind that a carefully chosen initial trace will focus the results on 
relevant topics, which will not necessarily hold true on the aggregate. This 
result is notwithstanding reflected in the discrepancy between direct 


association and equilibrium association. 


SUMMARY 


Comparisons of the basic fixed-point engine with refinements show it is a 
suitable summary statistic, remarkably close to more intricate alternatives 
throughout the study. The main difference is that the refinements rank words 
in groups of a wide range of inflections and composite words. Thus, there is a 
methodological continuity between the first paper and this one in spite of the 
changes. 
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There are starker differences in journalistic output. Succinct descriptive 
statistics generated by the algorithms are so far consistent with the general 
hypothesis of a more diverse commercial press, when comparing 


commensurable periods. However, this will be settled in forthcoming work. 


The individual searches on Terror, Palestine, and Occupation made in the first 
paper also differ in the same direction. The relative weights between the two 
first are switched, and the overall distribution between the three is more 
balanced. Palestine is now much more frequently used than Terror, in turn 


now much closer to the frequency of Occupation. 


Contrasting the first and second half of the month adds to the impression of 
a more independent journalism compared to SVT, or a lag between state 
narrative and diffusion in commercial outlets. The difference in emphasis on 
terror increases in the second half of the period. The commercial outlets have 
less immediate connection to the initial agenda-setting statements of Minister 


of Foreign Affairs in this particular regard. 


This continuity has traces of self-similarity in terms of most frequently used 
words. Especially at the level of sentences and paragraphs, but perhaps less 
so at the level of whole articles and aggregate journalistic output. Exact 
degree of self-similarity, and the underlying behavioural dynamics generating 


it, is a topic worthy of more attention in the future. 


Earlier discovery and analysis of exact propaganda fractals in journalism 
(Echeverria 2018, pp.166-174) showed that these were directly related to 
repetition of falsehoods and dogmatism. In view of indications of more diverse 
journalism, other causes to self-similarity in topic distribution must also be 
taken into consideration if significant. 


18 


MATHMATICAL ARTICULATION OF THE FRAMEWORK 


This paper considers operations on information structured as a language. The 


approach in this paper deals with the following conditions or problems: 


(1) Convoluted or idiosyncratic information. Expressions may be 
created with inconsistent rules. 

(2) Limited knowledge & autarky. Operations must work without 
complete information about the language, preferably without additional 
information to the dataset under study. (no complete dictionary or 
grammar-book condition) 

(3) Transparency requirement. The algorithms employed must allow for 


low-cost scrutiny, and preferably be simple or tractable. 


This setting opens up for serious challenges, for example: Convoluted 
information and limited knowledge (1 & 2) allow for expressions of unknown 
complexity which nevertheless must be accounted for. Transparency and 
autarky (2 & 3) limits the scope of at least a subset of machine-learning 
procedures which have been trained on large datasets exogenous to the 
dataset to be researched (2), or may yield good results but with virtually 
intractable rules (3). Low-cost scrutiny may e.g. be defined as an amount of 
effort/information processing to achieve transparency, below a certain share 


of what is required to compute the operation on the dataset. 


This setting is theoretically motivated in terms of the comprehensive scope 
(1), or with regard to common-knowledge requisites for scientific inquiry (2 
or 3). There are also much more mundane reasons, for example: State 
oppression and hacking may perturb incentives and the marginal propensity 
to be online to acquire databases. Hardware or other limitations may make it 
more efficient to program a tailor-made solution than to incur an economic or 
human-capital investment in alternative technology. In other words, this 


setting corresponds to an outsider or underprivileged person’s situation. 
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PROBABILISTIC APPROACH 


The operations considered here are related to the frequency distribution of 
expressions, and ranking of these accordingly. Depending on convention the 


scheme can be described in two or three steps. 


(1) Form groups of expressions 
(ii) © Rank groups formed in (i) 


(iii) Compute robustness of rank formed in (ii) 


Once the first two steps have been computed, the groups simply have been 
ranked in accordance with the sum total of the frequencies of each word in 
the group. Consistent with requirements (1-3), only a horizon of a couple of 
groups at the top are properly checked in step (i & ii), the rest are just the 
results of a simple algorithm. Such a list will therefore contain information 
about the distribution of counts by groups, and potentially more elaborated 
processed information about the top. The problem is that although the top (or 
order) may be satisfactory, partially by invoking arbitrary rules in addition to 
the simplest algorithm, there is no way of knowing if the simplest of rules 
have ordered the rest intelligibly, and there are no reasons to conclude such 


state of the world is the case for sure (1-3). 


Nevertheless, the distribution of groups makes it possible to account for 
network-configurations of unknown complexity by counting the number of 
groups it takes to beat the lowest member at the top, i.e. the weakest link of 


the order. A generalisation of this natural idea is: 


The probability a sample, of randomly selected groups of outsiders, has a 


combined count greater or equal than the weakest link. 


A configuration of groups not in the order (outsiders), which is greater than 
the weakest link is called a challenger. The size of the random sample is 


denoted B, and is called the breakthrough number. 
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The ranking can be understood as a list containing frequencies of group sizes. 
The latter are in the set {1, 2, ..., fmax}, where rmax is the heaviest group of 
the order in terms of counts — not number of subjects in the groups. Therefore, 
the probability of finding a challenger as a result of a random sample, of B 
drawings without replacement, is Multivariate-Hypergeometrically distributed 
(MVH). The probability of a challenger with breakthrough number B, with a 


weakest link of count W is denoted: 
Fl P(xr > W|I), xe MVH(B, g) 


Where g is a vector with the distribution of group sizes in terms of counts - 
i.e. the frequencies of distinct group sizes in the sample; x is the vector of 
groups, r is the vector with corresponding counts for the groups in x. The dot 
product xr is required to be such that each group variable is multiplied with 
its corresponding count. Furthermore, I is conditional info. E.g. a logical circuit 
over information on incompatibility of groups in the vicinity of the order, 
retrieved from the horizon in the previous steps. Such condition could lower 
the probability of a challenger. Useful information can be extracted from the 
graph-matrix containing vectors of connections for each word. The length of 
such truncated or directed vectors, informs about the degree distribution of 
connections (see next section). Extreme outliers can be used to rule out too 
large groups in terms of subjects, which together with the information from 


the top, narrows down or maps out the domain for challengers. 


In practice a 5-standard-deviation threshold identifies expressions which 
always have been ruled out so far, usually mistakes or with three or less 
characters. In conclusion, this approach and the formula pins down what is to 
be regarded as relevant exogenous information about language in this setting, 
but much can be deduced or at least plausibly conjectured from the dataset 
without additional info. Although this remark also underscores that F1 is based 


on a useful model, not ‘reality’, this point cannot be emphasised enough. 
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GROUP FORMATION 


From (2) there is at most an incomplete dictionary and grammar. The sample 
consist of about 700 items, roughly 390k words and 2.25 million characters in 
total. To speed up research, a list of less than 600 words, 60% between three 
and six characters, with about 2,7k characters in total was used to delete 
noise. This list contains connectives, the alphabet, digits etc. and is more than 
enough to produce the results in the first part. If such list is feasible, and there 
is no theoretical interest in the viability of full autarky, skip this section and 


go to The Graph of Connections. 


Exogenous information in aforementioned list can be drastically reduced with 
a partial model of what meaningful information or a language is. But in the 
end, even if this model is flawed, the general scheme of the previous section 
computes a summary probability measure for the reliability regardless. This is 
true by first principles. Hopefully the reader will be able to immediately think 
of tens of exceptions to these simple rules as he or she goes along, and will 
be able to write down at least a hundred or so within minutes. As all working 
hypotheses, the actual usefulness must be checked empirically. The results in 


the first part of this paper show that a few simple rules are remarkably useful. 


Information is considered as generated by a set of atoms. A sequence of atoms 
is information. In particular: ordered subsequences of atoms are delimited by 
markers; these units are called molecules, including composites. Sequences 
of molecules make up substance. Substance is generated by a configuration 


of rules or markers, possibly with unknown or inconsistent rules. 


The point of departure is a dataset to be researched, e.g. the one explored in 
this paper. This set is for purpose of exposition regarded as representative. If 
the sample was complete, then there would be a full dictionary. When all the 
rules or conventions necessary to form substance are known, then there is a 


full grammar. Molecules identified in the sample are facts, and (2) holds. 
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There is a basic assumption: Facts are likely to be useful to express 
something and can therefore be used to inquire without knowing their exact 
meaning. The operations to find out about implied grammar on this set is by 
and large restricted to intersections and complements between molecules. The 
ordered sequence of atoms is of interest. If subsequences between two 
molecules coincide, i.e. they intersect in terms of an ordered sequence of 


common atoms, then they have a common denominator. 


A related conjecture is that similar molecules belong together in terms of 
expressions. In other words, the conjectured existence of variations of 
meaningful molecules generated by the underlying rules. Variations or 
inflections are (to some extent) identified by the atomic complement of two 
molecules. In these cases, the common denominator is a fact - but the atomic 
complement is not. Conjectures can be made based on efficiency requirements 
or statistical inference. E.g. the complement is over a certain fraction of the 
factual molecule. In particular if it is heavier/longer than the factual molecule 


it would tend to be less likely as an attachment to denote inflection. 


If the complement is factual molecules or factual molecules with inferred 
attachments, it is a composite. If the grammar is somewhat consistent, the 
set of attachments in the inflection complement will tend to be recurrent, and 
if efficiency is assumed also limited. Under these circumstances, the set of 


inflection complement corresponds to a residual or rest of attachments. 


If a complement degenerates such that it is neither a fact proper, inflection, 


or a fact with attachment, then it is strange, i.e. rubbish. 


In summary, complements are employed to pin down viable related groups 
of molecules by checking whether intersecting expressions are composites, 
inflections or strange. This process of joining and deletion is like a Tetris game 
with a probability of meaningful or unrelated connections, where the former 


is conjectured to be greater than the latter for valid groups. 
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THE GRAPH OF CONNECTIONS 


The graph of connections between words with common denominators is 
central and will be derived in this section. Connections in the graph are made 
up by common denominators. To get an idea of its structure and purpose, 
note that lighter/shorter molecules will tend to have plenty of common 
denominators/connections to other molecules. Moreover, for a sizeable share 
of their connections, the intersections will fully cover them (complete). For 
now, a complete common denominator is denoted by one for the smallest, 
and those between zero and one are truncated to zero. When this information 


is stored in a matrix, it results in a directed graph of connections. 


More precisely: For purpose of exposition, consider molecules as ordered 
sets, or tuples of atoms <ai, a2, ..., An>. This can also be seen as a set with 
indexed elements, where the index or coordinates denote position of the 
elements proper. An overlap of threshold n is an operation Oj; over ordered 
sequences of atoms which belong to molecule mi, mj € Q respectively; where 
Q is the set of all molecules. In a more general setting, ni is a threshold value 
for the size of an intersection between any ordered subsequence of a molecule 


i, with another. The index of n will be supressed lest it is needed. 


For now, it is enough to pay attention to the case where n is the number of 
atoms of mi to be selected from the first element of mi up to the atom with 
index n. This sequence, 07, is then compared in n steps, element by element, 
to ordered sequences in mj. If the elements proper are not equal for each of 
the n steps, then an overlap is not returned (necessary condition). For purpose 
of exposition, consider the special case when the operation starts from the 
elements with lowest index and takes steps in increasing direction, and only 


the first overlap is returned, which is denoted 0;;. (*) 
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Consider the corresponding weights wy= a! and wyi= mal where |mx]| is the 


|mj| |mj| 
total number of elements in molecule k, or largest index. Note that |0j|= n if 
overlap exists, 0 otherwise. In general,|0j| & |0;;| could both be positive and 
differ in various ways depending on n, direction etc. In the special case (*), 


with nk=|mx| a heavier molecule is not overlapped by a smaller. Therefore: 


i= 1 if OF; #0 

60 otherwise 

A connection is formed when a molecule is smaller and overlaps, in which case 
its weight has the value one, and zero otherwise. In more general cases, wij 
potentially is at least a triple with information on whether the intersection is 
above threshold or not (1 or 0), and the relative size of the molecules in an 
intermediate case, e.g. when wj is smaller. This also is informative about the 
size of the overlap. The connections for each molecule to others are collected 
in vectors c;. The matrix M of such row vectors is the graph of the network of 
connection. The number of positive elements in ciis the number of connections 
of mi which can be used to spot statistical anomalies given the empirical 
distribution. Note however that when the set is restricted to a given topic, the 


keyword corresponding to that topic will tend to also be an anomaly. 


Basic groups are formed on the set F of facts, which clearly usually yields a 
much smaller number of molecules to handle than the /ist of facts. The row 
vectors in M correspond to candidate groups with subjects connected to a 
group leader, when mapped to the set of molecule tuples. The set of atomic 
complements between the leader and the subjects Cj; is the corresponding 
operation to Oj} with the same specifications as above but returns the 
remainder of mj; when the intersection is subtracted. If the atomic complement 
is rubbish, it is thrown away (see previous section for definition). In practice, 
groups with anomalous lightweight leaders with many connections but few 
independent counts, are usually dissolved. 
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This deliberately simplistic procedure yields remarkably sensible groups. The 
ranking is made by counting each unique molecule in F, in the dataset of 
interest, with an algorithm which counts networks of words corresponding to 
the row vectors above. This is what most word-processors do when they 
highlight your searches on a document. Only a horizon of an order of size o 
plus a number of potential outsiders adjacent to the order is treated, until the 
order is stable for this horizon, with the aforementioned basic-group 
procedure. This ranking can be reduced by removing inflections from the 


horizon, but I leave the details on this to the reader. 


GENERAL REMARKS 


There is a rather obvious intuition associated with social science and other 
domains but there is more to it than parables. A more subtle point is the 


parsimonious conceptualisation of general conditions shaping institutions: 


a) Multiplicity of rules, and distribution of knowledge about theses b) how 
information interacts with what is to be or not to be regarded as subjective d) 


how to deal with these characteristics when rules also are inconsistent. 


Note that this framework not only allows complicated atomic constructs both 
at the level of molecules and substance. Complexity is also allowed in terms 


of rules of inference or grammar, and these claims have been substantiated. 


Empirical remark: Relatives to the geometric distribution are evident at all 
levels, from word-frequencies in the networks of molecules, to aggregate 


journalistic output. 
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APPENDIX 


(A)COUNTING CONVENTIONS 

Probabilistic perspectives on accounting are given througout this paper in line 
with Echeverria (2020). This section promotes transparency with a rationale 
and a straightforward example for the counting conventions. In particular to 
virtual or double counts, real counts and probabilities in T5. First row of Al 
gives the basic counts of each category, A&B shows that category A and B 


have two elements in common. 


Al Double Counts 









































REAL DOUBLE 
' - c nee TOTAL COUNT 
COUNTS rl rl 12 2 18 20 
VIRTUAL SHARE | 0,2 0,2 0,6 0,1 1 
REAL ‘SHARE’ | 0,222 0,222 0,667 0,111 ‘pslaise 
A B C A&B AVBVC 
PROB On zoo mon 22 Ono Gy Gere 1 
AVB AVC BVC A|B 
0,333 0,889 _—0,889 0,5 








Although some would intuitively prefer a Rea/ Count, which avoids double 
counts, over Virtual Count, which is based on double counts, the latter is on 
all accounts more natural than the former. The resason is that simple 
substraction of the conjunction will not do. It makes no sense because the 
groups cannot both not contain the intersection, and it is potentially arbitrary 
if one does but not the other, and erroneous if both do because it would then 
be double count. As the table shows, shares or a probability measure based 
on substracted total (18) is degenerate. In contrast, virtual shares do not alter 
the underlying numbers and the shares or probability measures have the usual 
interpretations because the doubles are considered as different (virtual) 
entities. A proability concept to accounting is the correct approach as 


employed in the row ‘PROB’ if Virtua/ Counts are disregarded for some reason. 
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BASIC & REFINED RANKING 


For reference, the following graph on the most frequently used words in the 
subset Politics, War & Conflict, was created from a random sample of articles, 
half the size of the full set (n = 140). 


Top-10 Words 
Politics, War & Conflict 


Frequency 


1500} 


1000} 


500+ 








It gives a fair picture of the actual distribution, and the usefulness of the basic 
grouping algorithm, which is an improvement but has limitations. There is one 
exception to the rule. Pa/estinian and Palestine and related inflections should 
arguably be in a group on second place. However as their inflections do not 


have a common (empirical) denominator, they are treated as different groups. 
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The chart above is much more informative in terms of content compared to 
one based on whole words as below. Observe that the tracking algorithms are 


much closer to each other, with or without refinements. 


Top 10 Words 
Frequency 
2500 
2000: 
1500} 


1000/ 


500} 








Israel Israeli 
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